Reducing the fetch time of target instructions of a predicted taken branch instruction

ABSTRACT

A method and processor for reducing the fetch time of target instructions of a predicted taken branch instruction. Each entry in a buffer, referred to herein as a “branch target buffer”, may store an address of a branch instruction predicted taken and the instructions beginning at the target address of the branch instruction predicted taken. When an instruction is fetched from the instruction cache, a particular entry in the branch target buffer is indexed using particular bits of the fetched instruction. The address of the branch instruction in the indexed entry is compared with the address of the instruction fetched from the instruction cache. If there is a match, then the instructions beginning at the target address of that branch instruction are dispatched directly behind the branch instruction. In this manner, the fetch time of target instructions of a predicted taken branch instruction is reduced.

TECHNICAL FIELD

The present invention relates to the field of instruction execution incomputers, and more particularly to reducing the fetch time of targetinstructions of a predicted taken branch instruction.

BACKGROUND INFORMATION

Program instructions for a microprocessor are typically stored insequential, addressable locations within a memory. When theseinstructions are processed, the instructions may be fetched fromconsecutive memory locations and stored in a cache commonly referred toas an instruction cache. The instructions may later be retrieved fromthe instruction cache and executed. Each time an instruction is fetchedfrom memory, a next instruction pointer within the microprocessor may beupdated so that it contains the address of the next instruction in thesequence. The next instruction in the sequence may commonly be referredto as the next sequential instruction pointer. Sequential instructionfetching, updating of the next instruction pointer and execution ofsequential instructions, may continue linearly until an instruction,commonly referred to as a branch instruction, is encountered and taken.

A branch instruction is an instruction which causes subsequentinstructions to be fetched from one of at least two addresses: asequential address identifying an instruction stream beginning withinstructions which directly follow the branch instruction; or an addressreferred to as a “target address” which identifies an instruction streambeginning at an arbitrary location in memory. A branch instruction,referred to as an “unconditional branch instruction”, always branches tothe target address, while a branch instruction, referred to as a“conditional branch instruction”, may select either the sequential orthe target address based on the outcome of a prior instruction. It isnoted that when the term “branch instruction” is used herein, the term“branch instruction” refers to a “conditional branch instruction”.

To efficiently execute instructions, microprocessors may implement amechanism, commonly referred to as a branch prediction mechanism. Abranch prediction mechanism determines a predicted direction (taken ornot taken) for an encountered branch instruction, allowing subsequentinstruction fetching to continue along the predicted instruction streamindicated by the branch prediction. For example, if the branchprediction mechanism predicts that the branch instruction will be taken,then the next instruction fetched is located at the target address. Ifthe branch mechanism predicts that the branch instruction will not betaken, then the next instruction fetched is sequential to the branchinstruction.

If the predicted instruction stream is correct, then the number ofinstructions executed per clock cycle is advantageously increased.However, if the predicted instruction stream is incorrect, i.e., one ormore branch instructions are predicted incorrectly, then theinstructions from the incorrectly predicted instruction stream arediscarded from the instruction processing pipeline and the otherinstruction stream is fetched. Therefore, the number of instructionsexecuted per clock cycle is decreased.

A processor may include a fetch unit configured to fetch a group ofinstructions, referred to as a “fetch group.” The fetch group may befetched from an instruction cache and upon decoding may be enqueued inan instruction queue for execution. Currently, upon enquing a fetchgroup containing a branch instruction that is predicted taken in theinstruction queue, there is a delay, e.g., two cycle lag, in enquing thesubsequent instruction line (i.e., the branched instruction line) in theinstruction queue to be executed. This delay results in dead-time in thepipeline where no instructions are executed as illustrated in FIG. 1.

Referring to FIG. 1, FIG. 1 is a timing diagram illustrating that theinstructions at the branch target address (branched fetch group) areenqueued in the instruction queue two cycles after the enqueing of thefetch group containing a branch instruction. As illustrated in FIG. 1, afetch group, a group of instructions, is fetched in two stages, whichare indicated as IF1 and IF2. In the first stage, IF1 fetches fetchgroups A, A+10, A+20, B, B+10, B+20, B+30, B+40, B+50, C, C+10 and C+20in the indicated clock cycles. In the second stage, IF2 continues tofetch fetch groups A, A+10, B, B+10, B+20, B+30, B+40, C and C+10 in theindicated clock cycles.

At the decode stage, which is indicated as “DCD”, a branch instructionin the fetch group is determined as predicted taken or not taken. If thedecode logic at the decode stage determines that the branch instructionin the fetch group is predicted taken, then the signal identified as “BrPredict Taken” goes high. Otherwise, the signal “Br Predict Taken”remains low. For example, referring to FIG. 1, the decode logic at thedecode stage determined that the branch instruction in fetch groups Aand B+30 were predicted taken.

In the stage following the decode stage, the instructions are enqueuedin the instruction queue in the order to be executed. As illustrated inFIG. 1, fetch group A had a branch instruction that was predicted taken.Further, as illustrated in FIG. 1, the branch instruction branched tofetch group B. Hence, fetch group A was enqueued in the instructionqueue followed by enqueing fetch group B. However, there was a two cyclelag between the enqueing of fetch group A and fetch group B. As statedabove, this two cycle lag causes dead-time in the pipeline where noinstructions are executed.

The two cycle lag as illustrated in FIG. 1 may be exacerbated as thefrequency requirements of processors continue to grow. As the frequencyrequirements for processors continue to grow, i.e., increase in thenumber of cycles per second the processor operates, there is an increasein the number of clock cycles taken to fetch instructions into theprocessing pipeline. Hence, there may be an increase in the number ofinstructions between the top of the fetch pipeline (point at which theinitial instruction was fetched) and the point at which the branchprediction can be accomplished. As a result, there may be cases whereall the instructions may be dispatched while waiting for a predictedtaken branch to be accessed, i.e., waiting to fetch the instructions atthe branch target address, from the cache or other memory device. Thismay result in further dead-time in the pipeline than illustrated in FIG.1.

By reducing dead-time in the pipeline, i.e., reducing the delay inenqueing instructions following the branch instruction predicted takenin the instruction queue, a greater number of instructions may beprocessed by a processor in a given period of time.

Therefore, there is a need in the art to reduce the fetch time of targetinstructions of a predicted taken branch instruction.

SUMMARY

The problems outlined above may at least in part be solved in someembodiments by storing in each entry of a buffer, referred to herein asa “branch target buffer”, an address of a branch instruction predictedtaken and the instructions beginning at the target address of the branchinstruction predicted taken. When an instruction is fetched from theinstruction cache, a particular entry in the branch target buffer isindexed using particular bits of the fetched instruction. The address ofthe branch instruction in the indexed entry is compared with the addressof the instruction fetched from the instruction cache. If there is amatch and a branch prediction taken indication, the instructionsbeginning at the target address of that branch instruction aredispatched directly behind the branch instruction. The targetinstructions (instructions beginning at the target address of the branchinstruction) are dispatched directly behind the branch instruction sincethese are known from the indexed entry in the branch target buffer. Bydispatching the target instructions directly behind the branchinstruction, the target instructions may be decoded by the decode logicin the following clock cycle as decoding the branch instruction. Thetarget instructions may then be enqueued in the instruction queue in theclock cycle following the enquement of the branch instruction predictedtaken. In this manner, the fetch time of target instructions of apredicted taken branch instruction is reduced.

In one embodiment of the present invention, a method for reducing thefetch time of target instructions of a predicted taken branchinstruction comprises the step of accessing an instruction cache tofetch an instruction. The method may further comprise indexing into anentry in a buffer using bits from the instruction fetched from theinstruction cache. The buffer may comprise a plurality of entries whereeach of the plurality of entries comprises an address of a branchinstruction, a plurality of instructions beginning at a target addressof the branch instruction, prediction information for any of theplurality of instructions that are branch instructions and an address ofa next fetch group. The method may further comprise comparing an addressof the instruction fetched from the instruction cache with the addressof the branch instruction in the indexed entry of the buffer. The methodmay further comprise selecting the plurality of instructions beginningat the target address of the branch instruction in the indexed entry ofthe buffer if the address of the instruction fetched from theinstruction cache matches with the address of the branch instruction inthe indexed entry of the buffer.

The foregoing has outlined rather generally the features and technicaladvantages of one or more embodiments of the present invention in orderthat the detailed description of the present invention that follows maybe better understood. Additional features and advantages of the presentinvention will be described hereinafter which may form the subject ofthe claims of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS

A better understanding of the present invention can be obtained when thefollowing detailed description is considered in conjunction with thefollowing drawings, in which:

FIG. 1 is a timing diagram illustrating that the instructions at thebranch target address are enqueued in an instruction queue two clockcycles after enqueing a fetch group containing the branch instruction;

FIG. 2 is a high-level diagram of a processor in accordance with anembodiment of the present invention;

FIG. 3 is an embodiment of the present invention of the processorcontaining a mechanism to reduce the fetch time of target instructionsof a predicted taken branch instruction;

FIG. 4 is an embodiment of the present invention of an entry in thebranch target buffer;

FIG. 5 is a timing diagram illustrating the reduction in the fetch timeof target instructions of a predicted taken branch instruction inaccordance with an embodiment of the present invention;

FIGS. 6A-B are a flowchart of a method for reducing the fetch time oftarget instructions of a predicted taken branch instruction inaccordance with an embodiment of the present invention;

FIG. 7 is a flowchart of a method for updating the branch target bufferwith instructions and prediction information stored in the branch targetbuffer queue in accordance with an embodiment of the present invention;and

FIG. 8 is a flowchart of a method for updating the branch target bufferand the branch history table with updated prediction information inaccordance with an embodiment of the present invention.

DETAILED DESCRIPTION

The present invention comprises a method and processor for reducing thefetch time of target instructions of a predicted taken branchinstruction. In one embodiment of the present invention, each entry in abuffer, referred to herein as a “branch target buffer” (BTB), may storean address of a branch instruction predicted taken, the instructionsbeginning at the target address of the branch instruction predictedtaken, branch prediction information and the next fetch address. When aninstruction is fetched from the instruction cache, a particular entry inthe branch target buffer is indexed using particular bits of the fetchedinstruction. The address of the branch instruction in the indexed entryis compared with the address of the instruction fetched from theinstruction cache. If there is a match and a branch in the fetch groupis predicted taken, then the instruction fetched from the instructioncache is considered to have a BTB hit. Further, if there is a BTB hit,the instructions from the branch target buffer beginning at the targetaddress of that branch instruction are dispatched directly behind thebranch instruction. The target instructions (instructions beginning atthe target address of the branch instruction) are dispatched directlybehind the branch instruction since these are accessed from the indexedentry in the branch target buffer. By dispatching the targetinstructions directly behind the branch instruction, the targetinstructions may be decoded by the decode logic in the following clockcycle as decoding the branch instruction. The target instructions maythen be enqueued in the instruction queue in the clock cycle followingthe enquement of the branch instruction predicted taken. Also, thesubsequent cache line is directly fetched using the next fetch addressstored in the branch target buffer. In this manner, the fetch time oftarget instructions of a predicted taken branch instruction is reduced.

In the following description, numerous specific details are set forth toprovide a thorough understanding of the present invention. However, itwill be apparent to those skilled in the art that the present inventionmay be practiced without such specific details. In other instances,well-known circuits have been shown in block diagram form in order notto obscure the present invention in unnecessary detail. For the mostpart, details considering timing considerations and the like have beenomitted inasmuch as such details are not necessary to obtain a completeunderstanding of the present invention and are within the skills ofpersons of ordinary skill in the relevant art.

FIG. 2—High-Level Diagram of Processor

FIG. 2 is a high-level diagram of the major components of a processor200 including certain associated cache structures in accordance with anembodiment of the present invention. Also shown in FIG. 2 is a level-2cache 201.

Referring to FIG. 2, processor 200 may include a level-1 instructioncache 202, an instruction unit 203, a decode/issue portion 204 ofinstruction unit 203, a branch unit 205, execution units 206 and 207, aload/store unit 208, General Purpose Registers (GPRs) 209 and 210, alevel-1 data cache 211, and memory management units 212, 213 and 214. Ingeneral, instruction unit 203 obtains instructions from level-1instruction cache 202, decodes instructions via decode/issue unit 204 todetermine operations to perform, and resolve branch conditions tocontrol program flow by branch unit 205. Execution units 206 and 207perform arithmetic and logical operations on data in GPRs 209 and 210,and load/store unit 208 performs loads or stores data from/to level-1data cache 211. Level-2 cache 201 is generally larger than level-1instruction cache 202 or level-1 data cache 211, providing data tolevel-1 instruction cache 202 and level-1 data cache 211. Level-2 cache201 obtains data from a higher level cache or main memory through anexternal interface such as a processor local bus shown in FIG. 2.

Caches at any level are logically an extension of main memory unlikeregisters. However, some caches are typically packaged on the sameintegrated circuit chip as processor 200, and for this reason aresometimes considered a part of processor 200. In one embodiment,processor 200 along with certain cache structures are packaged in asingle semiconductor chip, and for this reason processor 200 may bereferred to as a “processor core” to distinguish it from the chipcontaining caches: level-1 instruction cache 202 and level-1 data cache211. However, level-2 cache 201 may not be in the processor corealthough it may be packaged in the same semiconductor chip. Therepresentation of FIG. 2 is intended to be typical, but is not intendedto limit the present invention to any particular physical or logicalcache implementation. It will be recognized that processor 200 andcaches could be designed according to system requirements, and chips maybe designed differently from represented in FIG. 2.

Referring to FIG. 2, memory management unit 212 may contain theaddressing environments for programs. Memory Management Unit (MMU) 212may be configured to translate/convert effective addresses (EAs)generated by instruction unit 203 or load/store unit 208 for instructionfetching and operand fetching. The instruction-microTLB (ITLB) 213 is amini MMU to copy a part of MMU contents to improve the instruction EAtranslation, and the data-micro TLB (DTLB) 214 is for the operand EAtranslation. Both ITLB 213 and DTLB 214 are provided for MMUacceleration to improve processor performance. FIG. 2 is intended to betypical, but is not intended to limit the present invention to anyparticular physical or logical MMU implementation.

Instructions from level-i instruction cache 202 are loaded intoinstruction unit 203 using ITLB 213 prior to execution. Decode/issueunit 204 selects one or more instructions to be dispatched/issued forexecution and decodes the instructions to determine the operations to beperformed or branch conditions to be performed in branch unit 205.

Execution units 206 and 207 comprise a set of general purpose registers(GPRs) 209 and 210 for storing data and an arithmetic logic unit (ALU)for performing arithmetic and logical operations on data in GPRs 209 and210 responsive to instructions decoded by decode/issue unit 204. AgainFIG. 2 is intended to be typical, but is not intended to limit thefunctional capability of execution unit 206 and 207. Execution units 206and 207 may include a floating point operations subunit and a specialvector execution subunit. In addition to the components shown in FIG. 2,execution units 206 and 207 may include special purpose registers andcounters, control registers and so forth. In particular, execution units206 and 207 may include complex pipelines and controls.

Load/store unit 208 is a separate unit but closely inter-connected toexecution units 206, 207 to provide data transactions from/to data cache211 to/from GPR 210. In one embodiment, execution unit 207 fetches datafrom GPR 210 for operand addresses EAs generation to be used byload/store unit 208 to read access data from data cache 211 using DTLB214 for EA to real address (RA) translation, or to write access datainto data cache 211 using DTLB 214 for its EA translation.

As stated in the Background Information section, there may be a multipleclock cycle lag between the enqueing of a fetch group containing abranch instruction predicted taken and the enqueing of the branchedfetch group. This delay may be exacerbated as the frequency requirementsof processors continue to grow. By reducing dead-time in the pipeline,i.e., reducing the delay in enqueing instructions following the branchinstruction predicted taken in the instruction queue, a greater numberof instructions may be processed by a processor in a given period oftime. Therefore, there is a need in the art to reduce the fetch time oftarget instructions of a predicted taken branch instruction. A processorconfigured with a mechanism to reduce the fetch time of targetinstructions of a predicted taken branch instruction is described belowin association with FIG. 3.

FIG. 3—Processor with Mechanism for Reducing the Fetch Time of TargetInstructions of a Predicted Taken Branch Instruction

FIG. 3 illustrates an embodiment of the present invention of processor200 (FIG. 2) containing a mechanism to reduce the fetch time of targetinstructions of a predicted taken branch instruction.

Referring to FIG. 3, processor 200 includes an instruction cache 202(FIG. 2) which is accessed in two stages which are designated asinstruction fetch IF1 and IF2. During the IF1 and IF2 stages, a fetchgroup, referring to a group of instructions, is fetched from instructioncache 202. Concurrently with the IF2 stage, a branch target buffer(“BTB”) 301 is accessed using designated bits, e.g., bits 23-26, of aninstruction in the fetch group fetched from instruction cache 202. Thisprocess may be repeated for each instruction in the fetch group fetchedfrom instruction cache 202. BTB 301 includes multiple entries, e.g.,sixteen. An embodiment of the present invention of an entry in BTB 301is illustrated in FIG. 4.

FIG. 4 illustrates an embodiment of the present invention of an entry inBTB 301. Referring to FIG. 4, BTB entry 400 may include entries 401A-L.Entry 401A may store bits 0-22 and 27-29 of the branch instructionaddress. Entry 401B may store an address of the target address (bits0-29) to generate the next fetch group. Entry 401C may store one of fourinstructions labeled “Instr0” (bits 0-31) along with its predecodeinformation (“predecode 0”) (bits 0-6) and a valid bit (“V”). Similarly,entry 401D may store one of four instructions labeled “Instr1” (bits0-31) along with its predecode information (“predecode 1”) (bits 0-6)and a valid bit (“V”). Entries 401E and 401F may store similarinformation for instructions labeled “Instr2” and “Instr3”,respectively. Instructions Instr0, Instr1, Instr2 and Instr3 begin at atarget address of the branch instruction where the target address isstored in entry 401A. Entry 401G may store a copy of the informationstored in a global history register (“GHR”) 306 (bits 0-5) discussedfurther below. Such information may be stored in entry 401G in order toensure that the global history value in BTB 301 kept close track to theinformation stored in GHR 306. Further, entry 401G may be updatedwhenever its accompanying prediction bits are updated as discussedfurther below. Entry 401G is updated along with updating entry 401H(bits 0-1). Entry 401H may be configured to store prediction information(“shared prediction information”) that may be used instead of predictioninformation stored in entries 401I-L when the GHR value stored in entry401G matches the value stored in GHR 306 as discussed below. Otherwise,the prediction information stored in one of the entries 401I-L (eachwith bits 0-1) may be used as discussed below. Entry 401I may storeprediction information for Instr0 if Instr0 is a branch instruction.Similarly, entries 401J-L may store prediction information for Instr1,Instr2 and Instr3, respectively, if Instr1, Instr2 and Instr3,respectively, is a branch instruction.

Returning to FIG. 3, processor 200 may further include a comparator 302configured to compare the address of the instruction fetched frominstruction cache 202, e.g., bits 0-22 and 27-29 of the fetchedinstruction, with the address in the indexed entry of BTB 301, e.g.,bits 0-23 and 28-29. The result indicates if the address fetched frominstruction cache 202 matches the branch address in the indexed entry ofBTB 301. When that occurs and the branch is predicted taken, then a “BTBhit” is said to occur.

Processor 200 further includes a selection mechanism 303, e.g., amultiplexer, that receives as inputs, the plurality of instructions,e.g., four instructions, located in the indexed entry in BTB 301 as wellas the same number of instructions, e.g., four instructions, that arelocated at the target address of the branch instruction predicted takenthat was fetched from instruction cache 202. For example, if a fetchgroup fetched from instruction cache 202 includes a branch instructionpredicted taken, then a fetch unit (not shown) would fetch the fetchgroup, e.g., four instructions, located at the target address of thebranch instruction predicted taken. These 4 instructions may be fetchedform instruction cache 202 and inputted to selection mechanism 303.Furthermore, the four instructions located in the indexed entry in BTB301 may be inputted to selection mechanism 303. Based on whether thereis a BTB hit, selection mechanism 303 would select either the pluralityof instructions located in the indexed entry in BTB 301 or the pluralityof instructions fetched by the fetch unit (not shown) located at thetarget address of the branch instruction predicted taken or sequentiallyfrom instruction cache 202 if there were no predicted taken branches. Ifthere is a BTB hit, then selection mechanism 303 selects the pluralityof instructions located in the indexed entry in BTB 301. Otherwise,selection mechanism 303 selects the instructions fetched frominstruction cache 202 by the fetch unit (not shown) located at thetarget address of the branch instruction predicted taken or thesubsequent fetched cache line.

The output of selection mechanism 303 is inputted to decode logic unit204 (FIG. 2) configured to determine if any of the instructions inputtedto decode logic unit 204 are branch instructions. As illustrated in FIG.3, the output of selection mechanism 303 is four words, each having bits0-31. These four words are stored in four registers along with predecodeinformation indicated by “pdcd”. Decode logic unit 204 may further storethe effective address of the address of these four words indicated by“DCD_EA”.

Processor 200 further includes a branch history table 305 (“BHT”)configured to store prediction information which is used to predict abranch instruction as taken or not taken. Branch history table 305includes a plurality of entries where each entry stores particularprediction information. Branch history table 305 may be indexed usingbits, e.g., bits 17-22, from an instruction fetched during the IF2 stageas well as the bits, e.g., bits 0-5, stored in a global history register(“GHR”) 306. Global history register 306 may contain 6-bits of branchhistory for the last six fetch groups that contained branches. If abranch is predicted “branch taken”, then a “1” will be shifted intoglobal history register 306. Otherwise, if a branch is predicted “nottaken”, then a “0” will be shifted into global history register 306.

The prediction information from the indexed entry of branch historytable 305 may be inputted to a selection mechanism 307, e.g.,multiplexer. Selection mechanism 307 may also receive the predictioninformation from the indexed entry in BTB 301. If there is a BTB hit,then selection mechanism 307 selects the prediction information from theindexed entry in BTB 301. By storing the prediction information in theindexed entry in BTB 301, accurate branch prediction can occur on BTBstored branch instructions. That is, accurate branch prediction canoccur on any of the target instructions stored in BTB 301 that happen tobe branch instructions. To further improve the branch predictionaccuracy of those branches in BTB 301, a set of shared (common)prediction bits are stored in entry 401H (FIG. 4) along with acorresponding GHR value stored in entry 401G (FIG. 4). When the GHRvalue stored in entry 401G matches the content of GHR 306, the sharedprediction bits may be used instead of the accompanying prediction bitsof the instruction. Otherwise, such prediction information would have tobe accessed from branch history table 305 which may result in severalextra cycles of delay. Furthermore, selection mechanism 307 selects theprediction information from branch history table 305 if there is not aBTB hit.

This prediction information may be used by decode logic unit 204 whichdetermines whether any of the instructions, e.g., four instructions,selected by selection mechanism 303 were predicted taken. As illustratedin FIG. 3, decode logic unit 204 may include registers storing bitsassociated with each instruction selected by selection mechanism 303that indicates whether the associated instruction is a branchinstruction predicted taken.

Processor 200 further includes a selection logic unit 308 coupled todecode logic unit 204 and to a selection mechanism 309, discussed below,that is coupled to decode logic unit 204. Selection logic unit 308 maybe configured to send a signal to selection mechanism 309 to output theaddress of the first instruction out of the plurality of instructionsreceived by decode logic unit 204 that is a branch instruction predictedtaken. If none of the instructions received by decode logic unit 204 aredetermined to be branch instructions by decode logic unit 204 or if noneof the instructions received by decode logic unit 204 that aredetermined to be branch instructions are predicted taken, then there isno branch redirection and the next sequential address and instructionsfrom IF2 and instruction cache 202 are loaded into decode logic unit204. The address and instructions from the decode stage selected byselection mechanism 309 and selection mechanism 312 (described below)are moved to the appropriate register (labeled IF1-A 310 and IF1-B 311)of the address register and later added by adder 313 prior to beingstored in an instruction queue (not shown). IF1-A 310 may be used tostore the address of the branch instruction; whereas, IF1-B 311 may beused to store the displacement of the branch instruction. By storing theinstructions, e.g., four instructions, at the target address of thefetched branch instruction in BTB 301, these instructions may bedispatched and executed directly behind the branch instruction. Hence,by already having these instructions ready to be dispatched andexecuted, the cycle penalty (dead-time in the pipeline as illustrated inFIG. 1) as discussed in the Background Information section is eliminatedas illustrated in FIG. 5.

Referring to FIG. 5, in conjunction with FIG. 3, FIG. 5 is a timingdiagram illustrating that the instructions at the branch target address(branched fetch group) are enqueued in an instruction queue in thefollowing clock cycle after the enqueing of the fetch group containing abranch instruction. As illustrated in FIG. 5, a fetch group is fetchedin two stages, which are indicated as IF1 and IF2. In the first stage,IF1 fetches fetch groups A, A+10, A+20, B+10, B+20 and B+30 in theindicated clock cycles. In the second stage, IF2 continues to fetchfetch groups A, A+10, B+10, B+20 and B+30 in the indicated clock cycles.

At the decode stage, which is indicated as “DCD”, a branch instructionin the fetch group is determined as predicted taken or not taken. If thedecode logic at the decode stage determines that the branch instructionin the fetch group is predicted taken, then the signal identified as “BrPredict Taken” goes high. Otherwise, the signal “Br Predict Taken” goeslow. For example, referring to FIG. 5, the decode logic at the decodestage determined that the branch instruction in fetch group A waspredicted taken.

In the stage following the decode stage, the instructions are enqueuedin the instruction queue in the order to be executed. As illustrated inFIG. 5, fetch group A had a branch instruction that was predicted taken.Further, as illustrated in FIG. 5, the branch instruction branched tofetch group B.

FIG. 5 further illustrates the comparing of the address in the indexedentry of BTB 301 (indicated by “BTB Addr”) with the address of theinstruction fetched from instruction cache 202 (indicated by “BTB Cmp”).If the address fetched from instruction cache 202 matches the branchaddress in the indexed entry of BTB 301, then a BTB hit occurs which isindicated by the activation of the signal designated as “BTB Hit”. Asillustrated in FIG. 5, BTB 301 stores the branch instruction address ofthe branch instruction in fetch group A. BTB 301 further stores theinstructions beginning at a target address of the branch instruction infetch group A (indicated by fetch group B) as well as the address of thenext fetch group (indicated by fetch group B+10).

Referring to FIG. 5, in conjunction with FIG. 3, since BTB 301 storesthe instructions beginning at a target address of the branch instructionand the address of the next fetch group, decode logic unit 204determined that the branch instruction predicted taken in fetch group Abranches to fetch group B in the next clock cycle. By storing theinstructions beginning at a target address of the branch instruction,the two cycle delay penalty as illustrated in FIG. 1 is eliminated.Further, in the next clock cycle, instructions at the address of thenext fetch group (B+10), is fetched in the IF1 stage.

Returning to FIG. 3, processor 200 further includes another selectionmechanism 312, e.g., multiplexer, that receives as inputs, the addressof the next fetch group from the indexed entry of BTB 301 and theeffective address of the branch instruction fetched from instructioncache 202. Selection mechanism 312 selects the address of the next fetchgroup from the indexed entry of BTB 301 to be outputted if there was aBTB hit. If there is a branch predicted taken but there is no BTB hit,then selection mechanism 312 computes the address of the next fetchgroup by adding the received effective address of the branch instructionwith the displacement in the branch instruction. The outputted addressis then inputted into IF1-B 311. An adder 313 adds the address stored inIF1-A 310 with the address stored in IF1-B 311 to be fetched in thesubsequent clock cycle in the IF1 stage.

Processor 200 further includes a BTB queue 314 coupled to a BTB reload315 coupled to BTB 301. BTB queue 314 may be configured to store theinstructions located at the target address of the branch instructionfetched from instruction cache 202. BTB queue 314 may further beconfigured to store prediction information selected from the indexedentry in branch history table 305.

The information stored in BTB queue 314 may be written to BTB 301 by BTBreload unit 315 if there was not a BTB hit and if the branch instructionfetched from instruction cache 202 by IF1 and IF2 was determined to betaken. As stated above, comparator 302 determines if there was a BTB hitwhose output is inputted to BTB reload unit 315. Further, BTB Reload 315unit receives a signal (indicated by “actual taken branch”) indicatingif the branch instructions predicted taken were actually taken. Thissignal may be produced towards the end of the branch execution pipeline.A method of updating BTB 301 with instructions and predictioninformation stored in BTB queue 314 is provided further below inassociation with FIG. 6.

Furthermore, processor 200 includes a logic unit 316 configured todetermine if the prediction bits stored in BTB 301 and in branch historytable 305 need to be updated. This logic unit may be referred to as the“prediction status update unit.” Prediction status update unit 316 mayreceive prediction bits that have been updated. These updated predictionbits may be the prediction bits in the indexed entry of BTB 301 thatneed to be updated. Prediction status update unit 316 may be configuredto store such updated prediction bits in BTB queue 314.

If BTB queue 314 stores such updated prediction bits, then BTB reloadunit 315 may update such prediction bits in the indexed entry in BTB 301and in the indexed entry in branch history table 305. The predictionbits are updated whenever it has been determined that the predictionbits in BTB 301 are incorrect, e.g., a branch from a BTB hit ispredicted taken in the decode stage and then the branch is determined tobe not taken in the execute stage. The prediction needs to be updated inBTB 301 so that the next time the branch is accessed from BTB 301 itwill be predicted not taken. A method of updating prediction informationin BTB 301 and in branch history table 305 is provided further below inassociation with FIG. 7.

A description of a method of reducing the fetch time of targetinstructions of a predicted taken branch instruction using the mechanismof FIG. 3 is provided below in association with FIGS. 6A-B.

FIGS. 6A-B - Method of Reducing the Fetch Time of Target Instructions ofa Predicted Taken Branch Instruction

FIGS. 6A-B are a flowchart of one embodiment of the present invention ofa method 600 for reducing the fetch time of target instructions of apredicted taken branch instruction.

Referring to FIG. 6A, in conjunction with FIGS. 2-3, in step 601,instruction cache 202 is accessed by a fetch unit (not shown) to fetch agroup of instructions (“fetch group”) in two stages indicated by IF1 andIF2.

In step 602, branch history table 305 is accessed during the fetchstages IF1 and IF2. In step 603, an entry in branch history table 305 isindexed using bits, e.g., bits 17-22, from the instruction fetchedduring the IF2 stage as well as the bits, e.g., bits 0-5, stored inglobal history register 306. The indexed entry may contain predictioninformation.

In step 604, branch target buffer 301 is accessed during the fetch stageof IF2. In step 605, an entry in branch target buffer 301 is indexedusing designated bits, e.g., bits 23-26, of the first instruction in thefetch group fetched from instruction cache 202. The indexed entryincludes an address of a branch instruction predicted taken, a pluralityof instructions, e.g., 4 instructions, beginning at a target address ofthe branch instruction, prediction information for any of the pluralityof instructions that are branch instructions and an address of the nextfetch group.

Upon execution of steps 603 and 605, a determination is made in step 606as to whether there was a “BTB hit”. That is, in step 606, adetermination is made as to whether the address fetched from instructioncache 202 matches the branch address in the indexed entry of BTB 301.When that occurs and the branch is predicted taken, a BTB hit is said tooccur.

If there is not a BTB hit, then, in step 607, instructions retrievedfrom accessing instruction cache 202 are selected by selection mechanism303 as discussed above. In step 608, selection mechanism 307 selects theprediction information obtained from branch history table 305 asdiscussed above.

Further, if there is not a BTB hit, then, in step 609, selectionmechanism 312 selects the effective address of the branch instruction tobe used to calculate the target address as discussed above. In step 610,the next instruction sequence at the target address of the branchinstruction is fetched in the next clock cycle.

If, however, there is a BTB hit, then, in step 611, selection mechanism303 selects the instructions obtained from the indexed entry of branchtarget buffer 301 in step 605. Further, in step 611, selection mechanism307 selects the prediction information obtained from the indexed entryof branch target buffer 301 in step 605.

Upon selecting instructions and prediction information from the indexedentry of branch target buffer 301 or upon selecting the instructionsfrom instruction cache 202 and selecting the prediction information fromthe indexed entry of branch history table 305, a determination is madeby decode logic unit 204 in step 612 as to whether any of theinstructions selected in steps 611 or 607 are branch instructions.

If none of these instructions are branch instructions, then instructionsretrieved from accessing instruction cache 202 are selected by selectionmechanism 303, as discussed above, in step 607.

Referring to FIG. 6B, in conjunction with FIGS. 2-3, if, however, one ofthese instructions are branch instructions, then, in step 613, adetermination is made by decode logic unit 204 as to whether the branchinstruction is predicted taken. If none of the branch instruction(s) arepredicted taken, then instructions retrieved from accessing instructioncache 202 are selected by selection mechanism 303, as discussed above,in step 607.

If, however, there is a branch instruction predicted taken, then, instep 614, selection mechanism 309 loads a displacement of the firstbranch instruction predicted taken in IF1-A 310 and loads an address ofthe first branch instruction predicted taken in IF1-B 311. In step 615,the instruction sequence at the target address of the branch instructionpredicted taken is fetched in the same clock cycle as illustrated inFIG. 5.

It is noted that method 600 may include other and/or additional stepsthat, for clarity, are not depicted. It is further noted that method 600may be executed in a different order presented and that the orderpresented in the discussion of FIGS. 6A-B B are illustrative. It isfurther noted that certain steps in method 600, e.g., steps 602-605;steps 607-609, may be executed in a substantially simultaneous manner.

As stated above, a description of a method of updating BTB 301 such asby updating BTB 301 with the instructions and prediction informationstored in BTB queue 314 is provided below in association with FIG. 7.

FIG. 7—Method of Updating Branch Target Buffer

FIG. 7 is a flowchart of one embodiment of the present invention of amethod 700 of updating BTB 301 (FIG. 3) with the instructions andprediction information stored in BTB queue 314 (FIG. 3).

Referring to FIG. 7, in conjunction with FIGS. 2-3, in step 701, theinstructions and prediction information are loaded in branch targetbuffer (BTB) queue 314 when the fetch group containing the branchinstruction predicted taken enters the decode stage and there was a BTBhit.

In step 702, a determination is made by BTB reload 315 as to whether thebranch instruction fetched by instruction cache 202 was actually taken.BTB reload 315 may receive a signal indicating whether the branchinstruction predicted taken was actually taken at the time the branchinstruction is executed as described above. If the branch instructionfetched by instruction cache 202 was not actually taken, then BTB queue314 is flushed in step 703.

If, however, the branch instruction fetched by instruction cache 202 wasactually taken, then, in step 704, the instructions and predictioninformation stored in BTB queue 314 are written to BTB 301. Upon writingthe instructions and prediction information stored in BTB queue 314 toBTB 301, BTB queue 314 is flushed in step 703.

It is noted that method 700 may include other and/or additional stepsthat, for clarity, are not depicted. It is further noted that method 700may be executed in a different order presented and that the orderpresented in the discussion of FIG. 7 is illustrative. It is furthernoted that certain steps in method 700 may be executed in asubstantially simultaneous manner.

As stated above, a description of a method of updating predictioninformation in BTB 301 and in branch history table 305 is providedfurther below in association with FIG. 8.

FIG. 8—Method for Updating Prediction Information

FIG. 8 is a flowchart of one embodiment of the present invention of amethod 800 of updating the prediction information stored in BTB 301(FIG. 3) and branch history table 305 (FIG. 3).

Referring to FIG. 8, in conjunction with FIGS. 2-3, in step 801, thebranch instruction fetched from instruction cache 202 completesexecution. In step 802, a determination is made by comparator 303 as towhether the executed branch instruction was a BTB hit.

If the executed branch instruction was not a BTB hit, then the nextbranch instruction fetched from instruction cache 202 completesexecution in step 801.

If, however, the executed branch instruction was a BTB hit, then, instep 803, a determination is made by prediction status update unit 316as to whether the prediction bits in BTB 301 and branch history table305 need to be updated. If prediction status update unit 316 determinesthat the prediction bits do not need to be updated (explanation of howprediction status update unit 316 determines whether the prediction bitswere updated is provided above), then, in step 804, BTB 301 and branchhistory table 305 are not updated. If, however, prediction status updateunit 316 determines that the prediction bits need to be updated, then,in step 805, prediction status update unit 316 determines if theprediction is correct.

If the prediction is correct, then, BTB 301 and branch history table 305are not updated in step 804. If, however, the prediction is incorrect,then, in step 806, prediction status update unit 316 loads the updatedprediction bits in BTB queue 314. In step 807, BTB reload 315 updatesthe appropriate prediction bits in the indexed entry (entry indexed instep 605 of FIG. 6A) in BTB 301. In step 808, BTB reload 315 updates thesame prediction bits in the indexed entry (entry indexed in step 603 ofFIG. 6A) in branch history table 305.

It is noted that method 800 may include other and/or additional stepsthat, for clarity, are not depicted. It is further noted that method 800may be executed in a different order presented and that the orderpresented in the discussion of FIG. 8 is illustrative. It is furthernoted that certain steps in method 800 may be executed in asubstantially simultaneous manner.

Although the method and processor are described in connection withseveral embodiments, it is not intended to be limited to the specificforms set forth herein, but on the contrary, it is intended to coversuch alternatives, modifications and equivalents, as can be reasonablyincluded within the spirit and scope of the invention as defined by theappended claims. It is noted that the headings are used only fororganizational purposes and not meant to limit the scope of thedescription or claims.

1. A method for reducing the normal fetch time of a predicted takenbranch instruction comprising the steps of: accessing an instructioncache to fetch an instruction; indexing into an entry in a buffer usingbits from said instruction fetched from said instruction cache, whereinsaid buffer comprises a plurality of entries, wherein each of saidplurality of entries comprises an address of a branch instruction, aplurality of instructions beginning at a target address of said branchinstruction, prediction information for any of said plurality ofinstructions that are branch instructions and an address of a next fetchgroup; comparing an address of said instruction fetched from saidinstruction cache with said address of said branch instruction in saidindexed entry of said buffer; and selecting said plurality ofinstructions beginning at said target address of said branch instructionin said indexed entry of said buffer if said address of said instructionfetched from said instruction cache matches with said address of saidbranch instruction in said indexed entry of said buffer.
 2. The methodas recited in claim 1 further comprising the step of: selecting saidinstruction retrieved from accessing said instruction cache if none ofsaid plurality of instructions selected from said indexed entry of saidbuffer is a branch instruction.
 3. The method as recited in claim 2further comprising the step of: selecting prediction informationobtained from a branch history table if none of said plurality ofinstructions selected from said indexed entry of said buffer is a branchinstruction.
 4. The method as recited in claim 1 further comprising thestep of: selecting said prediction information from said indexed entryof said buffer if said address of said instruction fetched from saidinstruction cache matches with said address of said branch instructionin said indexed entry of said buffer.
 5. The method as recited in claim1 further comprising the step of: determining if any of said pluralityof instructions selected is a branch instruction. 6-7. (canceled)
 8. Themethod as recited in claim 1 further comprising the steps of: loading abuffer queue coupled to said buffer with instructions selected from oneof said instruction cache and said indexed entry in said buffer; loadingsaid buffer queue coupled to said buffer with prediction informationselected from one of a branch history table and said indexed entry insaid buffer; and writing said instructions and prediction informationstored in said buffer queue to said buffer if said instruction fetchedfrom said instruction cache was a branch instruction actually taken. 9.A processor, comprising: an instruction cache configured to storeinstructions, wherein an instruction is fetched from said instructioncache; a buffer, wherein said buffer comprises a plurality of entries,wherein each of said plurality of entries in said buffer comprises anaddress of a branch instruction, a plurality of instructions beginningat a target address of said branch instruction, prediction informationfor any of said plurality of instructions that are branch instructionsand an address of a next fetch group, wherein an entry of said pluralityof entries in said buffer is indexed; and a first selection mechanismcoupled to said instruction cache and said buffer, wherein saidselection mechanism is configured to select said plurality ofinstructions beginning at said target address of said branch instructionin said indexed entry if an address of said instruction fetched fromsaid instruction cache matches with said address of said branchinstruction in said indexed entry of said buffer.
 10. The processor asrecited in claim 9, wherein said entry in said buffer is indexed usingbits from said instruction fetched from said instruction cache.
 11. Theprocessor as recited in claim 9 further comprising: a second selectionmechanism coupled to said first selection mechanism, wherein said secondselection mechanism is configured to select an address of one of saidplurality of instructions selected to be loaded into an instructionqueue.
 12. The processor as recited in claim 9 further comprising: athird selection mechanism coupled to said buffer and to a branch historytable, wherein said third selection mechanism is configured to selectsaid prediction information from said indexed entry of said buffer ifsaid address of said instruction fetched from said instruction cachematches with said address of said branch instruction in said indexedentry of said buffer.
 13. The processor as recited in claim 12 furthercomprising: a decode logic unit coupled to said first selectionmechanism, wherein said decode logic unit is configured to determine ifany of said plurality of instructions selected is a branch instructionif said address of said instruction fetched from said instruction cachematches with said address of said branch instruction in said indexedentry of said buffer.
 14. The processor as recited in claim 13 furthercomprising: a selection logic unit coupled to said third selectionmechanism, wherein said selection logic unit is configured to select afirst of said plurality of instructions selected.
 15. The processor asrecited in claim 14, wherein said first of said plurality ofinstructions selected is a branch instruction predicted taken.
 16. Theprocessor as recited in claim 15 further comprising: a buffer reloadunit coupled to a buffer queue, said buffer and to said branch historytable, wherein said buffer reload unit is configured to update saidprediction information stored in said indexed entry in said buffer if aprediction of said first of said plurality of instructions selected isincorrect, wherein said buffer reload unit is further configured toupdate said prediction information stored in said branch history tableif said prediction of said first of said plurality of instructionsselected is incorrect.
 17. The processor as recited in claim 9 furthercomprising: a buffer queue coupled to said buffer, wherein said bufferqueue is configured to store instructions selected from said instructioncache, wherein said buffer queue is further configured to storeprediction information selected from a branch history table.
 18. Theprocessor as recited in claim 17 further comprising: a buffer reloadunit coupled to said buffer queue and to said buffer, wherein saidbuffer reload unit is configured to write said instructions andprediction information stored in said buffer queue to said buffer ifsaid instruction fetched from said instruction cache was a branchinstruction actually taken.